185 research outputs found
Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoning
As a pivotal component to attaining generalizable solutions in human
intelligence, reasoning provides great potential for reinforcement learning
(RL) agents' generalization towards varied goals by summarizing part-to-whole
arguments and discovering cause-and-effect relations. However, how to discover
and represent causalities remains a huge gap that hinders the development of
causal RL. In this paper, we augment Goal-Conditioned RL (GCRL) with Causal
Graph (CG), a structure built upon the relation between objects and events. We
novelly formulate the GCRL problem into variational likelihood maximization
with CG as latent variables. To optimize the derived objective, we propose a
framework with theoretical performance guarantees that alternates between two
steps: using interventional data to estimate the posterior of CG; using CG to
learn generalizable models and interpretable policies. Due to the lack of
public benchmarks that verify generalization capability under reasoning, we
design nine tasks and then empirically show the effectiveness of the proposed
method against five baselines on these tasks. Further theoretical analysis
shows that our performance improvement is attributed to the virtuous cycle of
causal discovery, transition modeling, and policy training, which aligns with
the experimental evidence in extensive ablation studies.Comment: 28 pages, 5 figures, under revie
Deformable Kernel Expansion Model for Efficient Arbitrary-shaped Scene Text Detection
Scene text detection is a challenging computer vision task due to the high
variation in text shapes and ratios. In this work, we propose a scene text
detector named Deformable Kernel Expansion (DKE), which incorporates the merits
of both segmentation and contour-based detectors. DKE employs a segmentation
module to segment the shrunken text region as the text kernel, then expands the
text kernel contour to obtain text boundary by regressing the vertex-wise
offsets. Generating the text kernel by segmentation enables DKE to inherit the
arbitrary-shaped text region modeling capability of segmentation-based
detectors. Regressing the kernel contour with some sampled vertices enables DKE
to avoid the complicated pixel-level post-processing and better learn contour
deformation as the contour-based detectors. Moreover, we propose an Optimal
Bipartite Graph Matching Loss (OBGML) that measures the matching error between
the predicted contour and the ground truth, which efficiently minimizes the
global contour matching distance. Extensive experiments on CTW1500, Total-Text,
MSRA-TD500, and ICDAR2015 demonstrate that DKE achieves a good tradeoff between
accuracy and efficiency in scene text detection
Semantically Controllable Generation of Physical Scenes with Explicit Knowledge
Deep Generative Models (DGMs) are known for their superior capability in
generating realistic data. Extending purely data-driven approaches, recent
specialized DGMs may satisfy additional controllable requirements such as
embedding a traffic sign in a driving scene, by manipulating patterns
\textit{implicitly} in the neuron or feature level. In this paper, we introduce
a novel method to incorporate domain knowledge \textit{explicitly} in the
generation process to achieve semantically controllable scene generation. We
categorize our knowledge into two types to be consistent with the composition
of natural scenes, where the first type represents the property of objects and
the second type represents the relationship among objects. We then propose a
tree-structured generative model to learn complex scene representation, whose
nodes and edges are naturally corresponding to the two types of knowledge
respectively. Knowledge can be explicitly integrated to enable semantically
controllable scene generation by imposing semantic rules on properties of nodes
and edges in the tree structure. We construct a synthetic example to illustrate
the controllability and explainability of our method in a clean setting. We
further extend the synthetic example to realistic autonomous vehicle driving
environments and conduct extensive experiments to show that our method
efficiently identifies adversarial traffic scenes against different
state-of-the-art 3D point cloud segmentation models satisfying the traffic
rules specified as the explicit knowledge.Comment: 14 pages, 6 figures. Under revie
Functional Mapping of Plant Growth in Arabidopsis thaliana
Most traits important to agriculture, biology, and biomedicine are complex traits, determined by both genetic and environmental factors. The complex traits that change their phenotypes over different stages of development are called dynamic traits. Traditional quantitative trait loci (QTLs) mapping approaches ignore the dynamic changes of complex traits. Functional mapping, as a powerful statistical tool, can not only map QTLs that control the developmental pattern and process of complex traits, but also describe the dynamic changes of complex traits. In this study, we used functional mapping to identify those QTLs that affect height growth in 10th generation recombinant inbred lines derived from two different Arabidopsis thaliana accessions. Functional mapping identified 48 QTLs that are related to height traits. The growth curves of different genotypes can be drawn for each significant locus. By GO gene function annotations, we found that these QTLs detected are associated with the synthesis of biological macromolecules and the regulation of biological functions. Our findings provide unique insights into the genetic control of height growth of A. thaliana and will provide a theoretical basis for the study of complex traits
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Most existing text-video retrieval methods focus on cross-modal matching
between the visual content of videos and textual query sentences. However, in
real-world scenarios, online videos are often accompanied by relevant text
information such as titles, tags, and even subtitles, which can be utilized to
match textual queries. This insight has motivated us to propose a novel
approach to text-video retrieval, where we directly generate associated
captions from videos using zero-shot video captioning with knowledge from
web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated
captions, a natural question arises: what benefits do they bring to text-video
retrieval? To answer this, we introduce Cap4Video, a new framework that
leverages captions in three ways: i) Input data: video-caption pairs can
augment the training data. ii) Intermediate feature interaction: we perform
cross-modal feature interaction between the video and caption to produce
enhanced video representations. iii) Output score: the Query-Caption matching
branch can complement the original Query-Video matching branch for text-video
retrieval. We conduct comprehensive ablation studies to demonstrate the
effectiveness of our approach. Without any post-processing, Cap4Video achieves
state-of-the-art performance on four standard text-video retrieval benchmarks:
MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). The code is
available at https://github.com/whwu95/Cap4Video .Comment: Accepted by CVPR 2023. Selected as a Highlight (Top 2.5% of ALL
submissions
- …